Interactive python shell runs in the background and communicates with a browser, supports Matplots, HTML, JS, TeX markup, images inline
Similar to Mathematica, crucial concept is that it allows you to query data interactively and quickly, saving your input and output
I am using WinPython which is a portable Python environment.
There are others such as enthought, python x,y.
In [ ]:
# display help
# press ctrl + m + h
In [1]:
print 'hello world!'
We can insert cells at any position, remove cells, copy and paste entire cells
supports path completion (on linux and macs only), command completion, docstring and will open the source code (if available)
also supports shell commands
In [2]:
# ping and other shell commands are supported, prefix with !
!ping www.quantel.com
In [3]:
# display docstring
enumerate?
In [4]:
# can also open the source file
from pandas import *
read_csv
supports 'magic' functions
In [6]:
def for_loop():
ret = []
for x in range(10000):
ret.append(x)
return ret
def list_comp():
return [x for x in range(10000)]
# which is faster?
%timeit for_loop()
%timeit list_comp()
In [7]:
# list of all magic functions
%lsmagic
supports inline display of plots
In [8]:
plot(rand(100))
Out[8]:
and images
In [9]:
from IPython.core.display import Image
#Image(filename=r'Desktop\kid_f_off.jpg')
Image(filename=r'Downloads\Happy_Holidays.jpg')
In [10]:
# the above was a delibarate error to demonstrate traceback output
#Image(filename=r'C:\Users\alanwo\Desktop\kid_f_off.jpg')
Image(filename=r'C:\Downloads\Happy_Holidays.jpg')
Out[10]:
display youtube videos too!
In [11]:
from IPython.lib.display import YouTubeVideo
YouTubeVideo('tGvHNNOLnCk')
Out[11]:
and boring math
$\left( \sum_{k=1}^n a_k b_k \right)^2 \leq \left( \sum_{k=1}^n a_k^2 \right) \left( \sum_{k=1}^n b_k^2 \right)$
It also supports mixing python with other languages such as cython, R, Octave.
Scripting with Bash, Perl and Ruby are also supported.
Clustering is supported, multiple instances can be launched and the work divided between them.
• A set of labeled array data structures, the primary of which are Series/TimeSeries and DataFrame
• Index objects enabling both simple axis indexing and multi-level (lookups, data alignment, and reindexing)
• An integrated group by engine for aggregating and transforming data sets
• Date range generation (date_range)
• Input/Output tools: loading tabular data from flat files (CSV, delimited, Excel), and saving and loading pandas objects from the fast and efficient PyTables/HDF5 format.
• Memory-efficent “sparse” versions of the standard data structures for storing data that is mostly missing or mostly constant (some fixed value)
• Moving window statistics (rolling mean, rolling standard deviation, etc.)
• Static and moving window linear and panel regression
If you can't be bothered to read the above then you can think of Pandas as an in memory DB tool (but it is much more than this)
Some simple examples
In [12]:
from pandas import *
# Series
# s = Series(data, index=index)
# data => a Python dict, an ndarray, a scalar value
# NaN for missing data
sample_series = Series(randn(5))
sample_series
Out[12]:
In [13]:
# we can pass an index list for the rows
sample_series = Series(randn(5), index=['A','B','C','D','E'])
sample_series
Out[13]:
Series can be thought of as 1-d arrays (really a 1-d numpy ndarray underneath)
They are the fundamental datatype in Pandas
In [14]:
# calling .values decomposes to the underlying data
sample_series.values
Out[14]:
In [15]:
# some simple indexing
sample_series[0]
Out[15]:
In [16]:
sample_series[2]
Out[16]:
In [17]:
# DataFrame is one of :
# Dict of 1D ndarrays, lists, dicts, or Series
# 2-D numpy.ndarray
# Structured or record ndarray
# A Series
# Another DataFrame
sample_df = DataFrame(randn(5))
sample_df
Out[17]:
In [18]:
# again we can pass the column and row names
sample_df = DataFrame({'A':randn(5)}, index=['1', 'c', '5', 'g', 't']) # use abritrary row values
sample_df
Out[18]:
In [19]:
# data can consist of anything and can be hetereogenous for a column
from datetime import datetime
sample_df = DataFrame( {'A':['hello', 9, 12312.00, datetime.now(), [x for x in range(5)]]})
sample_df
Out[19]:
In [20]:
# we can construct from a dict of Series
d = {'col1':Series(randn(5), index=['a','b','c','d','e']), 'col2':Series(randn(4), index=['a','b','c','d'])}
d
Out[20]:
In [21]:
sample_df = DataFrame(d)
sample_df
Out[21]:
In [22]:
# column selection
sample_df['col1']
Out[22]:
In [23]:
# this is really a series
type(sample_df['col1'])
Out[23]:
In [24]:
# different to this, which is a dataframe
type(sample_df[['col1']])
Out[24]:
In [25]:
# row selection in dataframes using fancy indexing
sample_df.ix['a']
Out[25]:
In [26]:
# can also use irow, integer based indexing
sample_df.irow(1)
Out[26]:
In [27]:
# add a column
sample_df['new_col'] = randn(5)
sample_df
Out[27]:
In [28]:
# can compute new values based on other cols
sample_df['diff'] = sample_df['col1'] - sample_df['col2']
sample_df
Out[28]:
In [29]:
from datetime import datetime
sample_df['date'] = datetime.now()
sample_df
Out[29]:
In [30]:
# we can apply functions to a column, row or entire dataframe
sample_df['day'] = sample_df['date'].apply(lambda x: x.day)
sample_df
Out[30]:
In [31]:
# note that NaN values are propagated, we can deal with this in a few ways
# drop them
copy = sample_df.dropna()
copy
Out[31]:
In [32]:
# or replace with missing values
copy = sample_df
copy.fillna(0, inplace=True)
copy
Out[32]:
In [33]:
# we can select multiple columns
copy = sample_df[['col1', 'col2']]
copy
Out[33]:
In [34]:
# or a cross section
copy = sample_df[['col1', 'col2']][1:4]
copy
# note that for index selection [beg, end) - beg is included, end is excluded
Out[34]:
In [35]:
# we can also do label selection on rows and slice the columns
copy = sample_df.loc['b':, 'col1':'new_col']
copy
Out[35]:
In [36]:
# we can create a boolean mask
sample_df['col1']>0
Out[36]:
In [37]:
# then use this for selection
copy = sample_df[sample_df['col1']>0]
copy
Out[37]:
In [38]:
# can also use more advanced criteria
sample_df.ix[(sample_df['col1'] < 0) & (sample_df['diff']<0)]
# Note do not use or/and for boolean comparisons, you must use the bitwise versions |/& also for negation ~
Out[38]:
In [39]:
# we can get a specific value at row and column
sample_df.loc['c', 'col2']
Out[39]:
In [40]:
# yes fancy reverse works too
sample_df[::-1]
Out[40]:
In [41]:
# we can perform some simple stats
sample_df.describe()
Out[41]:
In [42]:
# do some aggregation like sum, count, cumsum and more
sample_df.sum()
Out[42]:
In [43]:
# plot values
sample_df[['col1', 'col2']].plot()
Out[43]:
OK. let's load some data
Pandas supports loading data from csv, html (will load from url, detect HTML tables and try to parse),
json, excel, sql, text files, clipboard memory, hdf and pickles too.
In [44]:
names = read_csv(r'c:\pandas-exercises-master\baby-names2.csv')
names.head()
Out[44]:
In [46]:
names.dtypes
Out[46]:
In [47]:
names.tail()
Out[47]:
In [48]:
names.dtypes
Out[48]:
In [49]:
names.columns
Out[49]:
In [50]:
# names are ordered by popularity, for year 1880 the top names were
names[names['year']==1880].head()
Out[50]:
In [51]:
# and the least popular
names[names['year']==1880].tail()
Out[51]:
In [52]:
# separate the boys from the girls
boys = names[names['sex']=='boy']
girls = names[names['sex']=='girl']
girls
Out[52]:
In [53]:
# perform a groupby operation
boys.groupby('year')
# this has created a groupby object, it has not performed any aggregation or grouping yet
Out[53]:
In [54]:
boys.groupby('year').size()
# shows we have 1000 entries for each year
Out[54]:
In [55]:
# this is the same for both boys and girls
names.groupby(['year','sex']).size()
Out[55]:
In [56]:
# we can index this object too, this selects top names for year 2001,2002
names.groupby(['year','sex']).size().ix[2000]
Out[56]:
In [57]:
# so lets filter so we can see the most popular names by year
boys[boys.year == 2000] # note that columns are also attributes
Out[57]:
In [58]:
# display the proportion
boys[boys.year==2000].prop
Out[58]:
In [59]:
# display the top 5
boys[boys.year == 2000][:5]
Out[59]:
In [60]:
# we can get the index value for the top proportion value
boys[boys.year == 2000].prop.idxmax()
Out[60]:
In [61]:
# this can be used for indexing
boys.ix[boys[boys.year == 2000].prop.idxmax()] # to get the whole row.
Out[61]:
In [62]:
# using this we can now write a function
def get_max_record(group):
return group.ix[group.prop.idxmax()]
get_max_record(boys)
Out[62]:
In [63]:
# do this for each year
result = boys.groupby('year').apply(get_max_record)
In [64]:
set_option('display.max_rows', 500)
result
Out[64]:
In [65]:
# plot this, this shows that the proportion required to be the most popular name has fallen generally, there are some notable peaks and troughs though.
# can you think why?
result.prop.plot()
Out[65]:
In [66]:
# lets select a specific name
boys[boys.name == 'Travis']
Out[66]:
In [67]:
# we can set a mutli index on the data frame and use this
idf = boys.set_index(['name', 'year'])
In [68]:
# select the bottom 50
idf[-50:]
Out[68]:
In [69]:
# where's travis?
idf.ix['Travis']
Out[69]:
In [70]:
# plot his popularity
idf.ix['Travis'].prop.plot()
Out[70]:
In [72]:
# calc the mean proportion
boys.groupby('name')['prop'].mean()
Out[72]:
Out[72]:
In [73]:
# sort it
boys.groupby('name')['prop'].mean().order()
Out[73]:
In [74]:
# get some stats
boys['prop'].describe()
Out[74]:
In [75]:
# groupby by year, call describe on the prop column
result = boys.groupby('year')['prop'].describe()
In [76]:
# display the first 50
result[:50]
Out[76]:
In [77]:
# lets look at 2008
df = boys[boys.year == 2008]
df.prop
Out[77]:
In [78]:
df = boys[boys.year == 2008].sort_index(by='prop', ascending=False) # If not in descending order. Can also do ascending=True for ascending.
df.prop
Out[78]:
In [79]:
# perform a cumulative summation
df.prop.cumsum() # numpy operation
Out[79]:
In [80]:
# how many does it take to reach 50%. Also called a measure of diversity.
df.prop.cumsum().searchsorted(0.5)
Out[80]:
In [81]:
# plot the rank where 50% proportion is
def get_quantile_count(group, quantile = 0.5):
df = group.sort_index(by='prop', ascending=False)
return df.prop.cumsum().searchsorted(quantile)
boys.groupby('year').apply(get_quantile_count).plot()
Out[81]:
In [82]:
# do this for boys and girls
def get_quantile_count(group, quantile=0.5): # Problem with no different colors for boys and girls.
group = group.groupby('soundex').sum()
df = group.sort_index(by='prop', ascending=False)
return df.prop.cumsum().searchsorted(quantile)
#f = lambda x: get_quantile_count(x, 0.1)
q = 0.5
boy_ct = boys.groupby('year').apply(get_quantile_count, quantile=q) # to pass different values for quantile
girl_ct = girls.groupby('year').apply(get_quantile_count, quantile=q)
boy_ct.plot(label='boy')
girl_ct.plot(label='girl')
legend(loc='best') # with --pylab=inline, we don't have to do plt.legend()
Out[82]:
In [83]:
# display the ranking
boys[boys.year == 2008].prop.rank() # mean rank by default.
Out[83]:
In [84]:
# create group object
grouped = boys.groupby('year')['prop']
In [85]:
grouped.transform(Series.rank) # transform is more rigid than apply. Output the same size as the input.
Out[85]:
In [86]:
# add this as a column
boys['year_rank'] = grouped.transform(Series.rank)
In [87]:
# how popular is Alan
boys[boys.name == 'Alan'].year_rank.plot()
Out[87]:
In [89]:
# Other names
boys[boys.name == 'Michael'].year_rank.plot()
Out[89]:
In [90]:
# lets load some birth data
births = read_csv(r'c:\pandas-exercises-master\births.csv')
In [91]:
merged = merge(names, births, on=['year', 'sex']) # merge 2 tables: names and births., inner by default
In [92]:
# calc the number of people born with a given name by multiplying proportion by total number of births
merged['persons'] = np.floor(merged.prop * merged.births)
In [94]:
merged.head()
Out[94]:
Out[94]:
In [95]:
merged.groupby(['name', 'sex'])['persons'].sum() # slice and dice. It's a hierarhical labeling.
Out[95]:
In [96]:
merged.groupby(['name', 'sex'])['persons'].sum().order()
Out[96]:
In [97]:
mboys = merge(boys, births) # inner join by default.
In [98]:
# calc the number of boys born with a given name
mboys['persons'] = np.floor(mboys.prop * mboys.births)
In [99]:
# Select out persons
persons = mboys.set_index(['year', 'name']).persons
In [100]:
persons # hierarhical index
Out[100]:
In [101]:
# Select out all the people named Chris. Plot is kind of crowded. Matplotlib doesn't go more than 130 in x axis.
persons.ix[:, 'Christopher'].plot(kind='bar', rot=90)
Out[101]:
In [102]:
# what about me?
persons.ix[:, 'Alan'].plot(kind='bar', rot=90)
Out[102]:
In [103]:
persons.unstack('name') # Create a data frame whose columns are each unique names, and the row indexes are the years.
Out[103]:
In [104]:
result = _ # underscore in ipython: the output of the last statement, because we don't want to compute again the same thing.
In [105]:
result['Alan']
Out[105]:
In [106]:
result['Alan'].plot()
Out[106]:
In [107]:
# e.g.
sample_df['col2'].tolist()
Out[107]:
In [108]:
# dataframes become a list of lists
sample_df[['col1','col2']].values.tolist()
Out[108]:
In [109]:
# for dictionaries key values are columns, values are dicts with key values as row index value: row value
sample_df.to_dict()
Out[109]:
Some background about De Bruijn Graph http://en.wikipedia.org/wiki/De_Bruijn_graph
Set-up networkx for visualizing the local structure of such network.
In [110]:
import networkx as nx
from IPython.core.display import display_javascript
#from IPython.frontend.html.notebook import visutils as vis
import json
import time
Load D3.js
In [111]:
%install_ext https://raw.github.com/cschin/ipython_d3_mashup/master/extension/visutils.py
%reload_ext visutils
vis.run_js("$.getScript('http://d3js.org/d3.v2.js')")
vis.run_js("$.getScript('https://raw.github.com/cschin/ipython_d3_mashup/master/extension/vis_extension.js')")
time.sleep(2)
vis.run_js("IPython.vis_init();")
Set up the visulization "cell"/widget.
In [112]:
try:
vis_display.remove()
except:
pass
plot_area_style = {"position":"absolute",
"top":"0px",
"width":"850px",
"left":"750px",
"height":"350px",
"border":"9px groove",
"background-color":"rgba(200,200,200,0.5)"}
vis_cell = vis.VISCellWidget(name="plot_area", style = plot_area_style)
## attache the container to a "visual display"
vis_display = vis.NotebookVisualDisplay(container = vis_cell)
## create the SVG element for D3
svg_style = {"width":"850px",
"height":"300px",
"border":"2px solid"}
svg = vis.SVGWidget(name = "svg_display",
parent = "plot_area",
style = svg_style,
vis = vis_display)
In [113]:
def replace_nodes(G0, Ns, N0):
G0.add_node(N0, X="X")
head = Ns[0]
tail = Ns[-1]
p = G0.predecessors(Ns[0])
if len(p) == 1:
p = p[0]
G0.add_edge(p, N0)
n = G0.successors(Ns[-1])
if len(n) == 1:
n = n[0]
G0.add_edge(N0, n)
G0.remove_nodes_from(Ns)
def reduce_graph(G0):
G1 = G0.copy()
G2 = G0.copy()
nodes_to_remove = []
for n in G1.nodes():
if len(G1.successors(n)) > 1 or len(G1.predecessors(n)) > 1:
if n[0] != "^" and n[-1] != "$":
nodes_to_remove.append(n)
G1.remove_nodes_from(nodes_to_remove)
for ns in nx.weakly_connected_components(G1):
ns = [n for n in ns if n[0] != "^" and n[-1] != "$"]
if len(ns) == 0: continue
contig = []
n_sorted = nx.topological_sort(G1, ns)
n_sorted = [n for n in n_sorted if n[0] != "^" and n[-1] != "$"]
if len(n_sorted) <= 1: continue
for kmer in n_sorted:
assert len(G1.successors(kmer)) <= 1
if len(contig) == 0:
contig.append(kmer)
else:
contig.append(kmer[-1])
replace_nodes(G2, n_sorted, "".join(contig))
return G2
Generate the json for d3.js for the force layout.
In [114]:
def get_group(w, w1):
c = 0
for c1, c2 in zip(w,w1):
if c1 == c2:
c += 1
continue
break
return c
def set_g_json(seq, k, reduce_g = False):
G=nx.DiGraph()
seq = "^"+seq+"$"
for i in range(len(seq)-k+1):
w1 = seq[i:i+k-1]
w2 = seq[i+1:i+k]
G.add_edge(w1, w2)
if reduce_g == True:
G = reduce_graph(G)
def generateD3JSONForG(G):
s = {"nodes":[], "links":[]}
name2Idx = {}
c = 0
for n in G.nodes():
#print n
g = len(G.neighbors(n))
if "^" in n:
s["nodes"].append({"name":n, "group":g, "fixed":True, "x":0,"y":150})
elif "$" in n:
s["nodes"].append({"name":n, "group":g, "fixed":True, "x":850,"y":150})
else:
s["nodes"].append({"name":n, "group":g})
name2Idx[n] = c
c += 1
for e in G.edges():
col = "rgb(0,0,255)"
width = 1
s["links"].append({"source":name2Idx[e[0]], "target":name2Idx[e[1]], "color":col, "width":width})
return json.dumps(s)
n_json = generateD3JSONForG(G)
vis_cell.set_js_var("n_json", n_json)
Set up d3.js code for showing up the network around the word START.
In [115]:
vis_display.js_code=[]
set_g_json("ACGTACGTTGTGCAGTAGTAGTAGT",5)
js = """
(function() {
var plot_neighbor=function(json) {
var w = 850,
h = 300,
fill = d3.scale.category10();
var vis = d3.select("#plot_area #svg_display")
var force = d3.layout.force()
.charge(-40)
.linkDistance(2)
.nodes(json.nodes)
.links(json.links)
.size([w, h])
.linkStrength(0.1)
.start();
var link = vis.selectAll("line.link")
.data(json.links)
.enter().append("svg:line")
.attr("class", "link")
.style("stroke-width", function(d) { return d.width; })
.style("stroke", function(d) { return d.color; })
.attr("x1", function(d) { return d.source.x; })
.attr("y1", function(d) { return d.source.y; })
.attr("x2", function(d) { return d.target.x; })
.attr("y2", function(d) { return d.target.y; });
var node = vis.selectAll("circle.node")
.data(json.nodes)
.enter().append("svg:circle")
.attr("class", "node")
.attr("cx", function(d) { return d.x; })
.attr("cy", function(d) { return d.y; })
.attr("r", 4)
.style("fill", function(d) { return fill(d.group); })
.call(force.drag);
node.append("svg:title")
.text(function(d) { return d.name; });
vis.style("opacity", 1e-6)
.transition()
.duration(1000)
.style("opacity", 1);
// Per-type markers, as they don't inherit styles.
vis.append("svg:defs").selectAll("marker")
.data(["arrow"])
.enter().append("svg:marker")
.attr("id", String)
.attr("viewBox", "0 -5 10 10")
.attr("refX", 15)
.attr("refY", -1.5)
.attr("markerWidth", 6)
.attr("markerHeight", 6)
.attr("orient", "auto")
.append("svg:path")
.attr("d", "M0,-5L10,0L0,5");
var path = vis.append("svg:g").selectAll("path")
.data(force.links())
.enter().append("svg:path")
.attr("class", function(d) { return "link arrow"; })
.attr("marker-end", function(d) { return "url(#arrow)"; })
.style("stroke-width", "1.5px");
force.on("tick", function() {
link.attr("x1", function(d) { return d.source.x; })
.attr("y1", function(d) { return d.source.y; })
.attr("x2", function(d) { return d.target.x; })
.attr("y2", function(d) { return d.target.y; });
node.attr("cx", function(d) { return d.x; })
.attr("cy", function(d) { return d.y; });
path.attr("d", function(d) {
var dx = d.target.x - d.source.x,
dy = d.target.y - d.source.y,
dr = Math.sqrt(dx * dx + dy * dy);
return "M" + d.source.x + "," + d.source.y + "L" + d.target.x + "," + d.target.y;
})
});};
var vc = IPython.vis_utils.name_to_viscell["plot_area"];
//alert(vc.data);
var n_json=$.parseJSON(vc.data.n_json);
//var n_json = vc.data.n_json;
//alert(vc.data["n_json"]);
plot_neighbor(n_json)})()
"""
vis_display.attach_js_code(js)
vis_display.refresh()
In [116]:
vis_display.hide()
In [117]:
vis_display.show()
In [118]:
def show_neighbors(w, k, reduce_g = False):
vis.run_js('$("#svg_display *").remove();')
set_g_json(w, k, reduce_g = reduce_g)
for jc in vis_display.js_code:
vis.run_js(jc)
## create a test input text box
input_style = {"width":"440px"}
tb = vis.InputWidget(name = "input_1",
parent = "plot_area",
style = input_style,
value = "AATTAATTAAGGTTTTAATTATTAATTGTAATTAATTAATTAATACTGAT",
vis = vis_display)
def onchange(self, *argv, **kwargv):
self.update_value()
vis.set_action(tb, "onchange", onchange)
## create a input text box for k
input_style = {"width":"40px"}
kb = vis.InputWidget(name = "input_2",
parent = "plot_area",
style = input_style,
value = "5",
vis = vis_display)
vis.set_action(kb, "onchange", onchange)
button_style = {"width":"120px"}
b = vis.ButtonWidget(name="button",
parent="plot_area",
style=button_style,
text="show graph",
vis = vis_display)
b.argv = [tb, kb]
def onclick(self, *argv, **kwargv):
self.text = argv[0].value
self.k = int(argv[1].value)
show_neighbors(self.text, self.k, reduce_g = False)
vis.set_action(b, "onclick", onclick, "argv")
button_style = {"width":"120px"}
b_r = vis.ButtonWidget(name="button2",
parent="plot_area",
style=button_style,
text="reduce graph",
vis = vis_display)
b_r.argv = [tb, kb]
def onclick(self, *argv, **kwargv):
self.text = argv[0].value
self.k = int(argv[1].value)
show_neighbors(self.text, self.k, reduce_g = True)
vis.set_action(b_r, "onclick", onclick, "argv")
button_style = {"width":"120px"}
b2 = vis.ButtonWidget(name="button3",
parent="plot_area",
style=button_style,
text="close",
vis = vis_display)
def onclick(self, *argv, **kwargv):
vis_display.remove()
vis.set_action(b2, "onclick", onclick)
vis_display.refresh()
In [ ]:
In [ ]:
Python for Data Analysis - written by the main developer of Pandas
In [ ]:
In [ ]: